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Abstract —We present a neural network architecture and 
training method designed to enable very rapid training and low 
implementation complexity. Due to its training speed and very 
few tunable parameters, the method has strong potential for 
applications requiring frequent retraining or online training. The 
approach is characterized by (a) convolutional filters based on 
biologically inspired visual processing filters, (b) randomly-valued 
classifier-stage input weights, (c) use of least squares regression to 
train the classifier output weights in a single batch, and (d) linear 
classifier-stage output units. We demonstrate the efficacy of the 
method by applying it to image classification. Our results match 
existing state-of-the-art results on the MNIST (0.37% error) 
and NORB-small (2.2% error) image classification databases, 
but with very fast training times compared to standard deep 
network approaches. The network’s performance on the Google 
Street View House Number (SVHN) (4% error) database is also 
competitive with state-of-the art methods. 

1. Introduction 

State-of-the-art performance on many image classification 
databases has been achieved recently using multilayered (i.e., 
deep) neural networks [1]. Such performance generally relies 
on a convolutional feature extraction stage to obtain invariance 
to translations, rotations and scale [2-5]. Training of deep 
networks, however, often requires significant resources, in 
terms of time, memory and computing power (e.g. in the order 
of hours on GPU clusters). Tasks that require online learning, 
or periodic replacement of all network weights based on fresh 
data may thus not be able to benefit from deep learning 
techniques. It is desirable, therefore, to seek very rapid training 
methods, even if this is potentially at the expense of a small 
performance decrease. 

Recent work has shown that good performance on image 
classification tasks can be achieved in ‘shallow’ convolutional 
networks—neural architectures containing a single training 
layer—provided sufficiently many features are extracted [3]. 
Perhaps surprisingly, such performance arises even with the 
use of entirely random convolutional filters or filters based on 
randomly selected patches from training images [4]. Although 
application of a relatively large numbers of filters is common 
(followed by spatial image smoothing and downsampling), 
good classification performance can also be obtained with 
a sparse feature representation (i.e. relatively few filters and 
minimal downsampling) [5]. 

Based on these insights and the goal of devising a fast 
training method, we introduce a method for combining several 


existing general techniques into what is equivalent to a five 
layer neural network (see Figure 1) with only a single trained 
layer (the output layer), and show that the method: 

1) produces state-of-the-art results on well known image 
classification databases; 

2) is trainable in times in the order of minutes (up 
to several hours for large training sets) on standard 
desktop/laptop computers; 

3) is sufficiently versatile that the same hyper-parameter 
sets can be applied to different datasets and still 
produce results comparable to dataset-specific opti¬ 
misation of hyper-parameters. 

The fast training method we use has been developed 
independently several times [6-9] and has gained increasing 
recognition in recent years—see [10-13] for recent reviews 
of the different contexts and applications. The network ar¬ 
chitecture in the classification stage is that of a three layer 
neural network comprised from an input layer, a hidden layer 
of nonlinear units, and a linear output layer. The input weights 
are randomly chosen and untrained, and the output weights are 
trained in a single batch using least squares regression. Due to 
the convexity of the objective function, this method ensures the 
output weights are optimally chosen for a given set of random 
input weights. The rapid speed of training is due to the fact that 
the least squares optimisation problem an be solved using an 
O(KM^) algorithm, where M is the number of hidden units 
and K the number of training points [14]. 

When applied to pixel-level features, these networks can 
be trained as discriminative classifiers and produce excellent 
results on simple image databases [14-19] but poor perfor¬ 
mance on more difficult ones. To our knowledge, however, the 
method has not yet been applied to convolutional features. 

Therefore, we have devised a network architecture (see 
Figure 1) that consists of three key elements that work to¬ 
gether to ensure fast learning and good classification perfor¬ 
mance: namely, the use of (a) convolutional feature extraction, 
(b) random-valued input weights for classification, (c) least 
squares training of output weights that feed in to (d) linear 
output units. We apply our network to several image classifica¬ 
tion databases, including MNIST [20], CIFAR-10 [21], Google 
Street View House Numbers (SVHN) [22] and NORB [23]. 
The network produces state-of-the-art classification results on 
MNIST and NORB-small databases and near state-of-the-art 
performance on SVHN. 



These promising results are presented in this paper to 
demonstrate the potential benefits of the method; clearly fur¬ 
ther innovations within the method are required if it is to be 
competitive on harder datasets like CIFAR-10, or Imagenet. 
We expect that the most likely avenues for improving our 
presented results for CIFAR-10, whilst retaining the method’s 
core attributes, are (1) to introduce limited training of the Stage 
1 filters by generalizing the method of [17]; (2) introduction 
of training data augmentation. We aim to pursuing these 
directions in our future work. 

The remainder of the paper is organized as follows. Sec¬ 
tion II contains a generic description of the network architec¬ 
ture and the algorithms we use for obtaining convolutional 
features and classifying inputs based on them. Section III 
describes how the generic architecture and training algorithms 
are specifically applied to four well-known benchmark image 
classification datasets. Next, Section IV describes the results 
we obtained for these datasets, and finally the paper concludes 
with discussion and remarks in Section V. 

II. Network architecture and training 

ALGORITHMS 

The overall network is shown in Figure 1. There are 
three hidden layers with nonlinear units, and four layers of 
weights. The first layer of weights is the convolutional filter 
layer. The second layer is a pooling (low pass filtering) and 
downsampling layer. The third layer is a random projection 
layer. The fourth layer is the only trained layer. The output 
layer has linear units. 

The network can be conceptually divided into two stages 
and two algorithms, that to our knowledge have not previously 
been combined. The first stage is the convolutional feature 
extraction stage, and largely follows that of existing approaches 
to image classification [3-5,25]. The second stage is the 
classifier stage, and largely follows the approach of [14,19]. 
We now describe the two stages in detail. 

A. Stage 1 Architecture: Convolutional filtering and pooling 

The algorithm we apply to extract features from images 
(including those with multiple channels) is summarised in 
Algorithm 1. Note that the details of the filters h.i c and 
hp described in Algorithm 1 are given in Section III-B, but 
here we introduce the size of these two-dimensional filters as 
W xW and QxQ. The functions ^i(-) and ^ 2 (') are nonlinear 
transformations applied termwise to matrix inputs to produce 
matrix outputs of the same size. The symbol * represents two- 
dimensional convolution. 

This sequence of steps in Algorithm 1 suggest looping 
over all images and channels sequentially. However, the fol¬ 
lowing mathematical formulation of the algorithm indicates a 
standard layered neural network formulation of this algorithm 
is applicable, as shown in Figure 1, and therefore that com¬ 
putation of all features (f/c,/c = 1^..K ) can be obtained in 
one shot from a -column matrix containing a batch of K 
training points. 

The key to this formulation is to note that since convolution 
is a linear operator, a matrix can be constructed that when 
multiplied by a data matrix produces the same result as 


Input : Set of K images, x/^, each with C channels 

Output: Feature vectors, f/c, /c = 1,... AT 

foreach x/^ do 

split x/c into channels, x/^^c, c= 1,...C 

foreach i = 1,... P filters do 

Apply filter to each channel: yi^k,c ^ ^i,c^^k,c 
Apply termwise nonlinearity: Zi^k,c ^ gi{yi,k,c) 
Apply lowpass filter: Wi^k,c ^ hp * Zi^k,c 
Apply termwise nonlinearity: Si^k,c ^ g 2 {'^i,k,c) 
Downsample: Si^k,c ^ Si,k,c 
Concatenate channels: Vi^k ^ [s 2 ,/c,i| • • • \^i,k,c] 
Normalize: fi^k = 

end 

Concatenate over filters: ^ [fi,/c|f 2 ,/c| • • • |fp,/c] 

end 

Algorithm 1: Convolutional feature detection. 


convolution applied to one instance of the data. Hence, for 
a total of L features per image, we introduce the following 
matrices. 

Let 

• F be a feature matrix of size L x K\ 

• X be a data matrix with K columns; 

• Wpiiter be a concatenation of the CP convolution ma¬ 
trices corresponding to i = l,...P, c=l,...C; 

• Wo be a convolution matrix corresponding to hi, that 
also down samples by a factor of D; 

• Wpooi be a block diagonal matrix containing CP 
copies of Wo on the diagonals. 

The entire fiow described in Algorithm 1 can be written 
mathematically as 

F = ^2(Wpool 5'l(WFiiterX)), (1) 

where ^i(-) and ^ 2 (') are applied term by term to all elements 
of their arguments. The matrices Wpiiter and Wpooi are sparse 
Toeplitz matrices. In practice we would not form them directly, 
but instead form one pooling matrix, and one filtering matrix 
for each filter, and sequential apply each filter to the entire 
data matrix, X. 

We use a particular form for the nonlinear hidden-unit 
functions ^i(-) and ^ 2 (') inspired by LP-pooling [25], which 
is of the form ^ 1 ( 1 ^) = and g 2 {v) = vp. For example, with 
p = 2 we have 

F = ^/W Pool (WFiiterX)^. (2) 


An intuitive explanation for the use of LP-pooling is as 
follows. First, note that each hidden unit receives as input 
a linear combination of a patch of the input data, i.e. u in 
gi{u) has the form u = Hence, squaring u 

results in a sum that contains terms proportional to x‘^ j and 
terms proportional to products of each Xij. Thus, squaring is 
a simple way to produce hidden layer responses that depend 
on the product of pairs of input data elements, i.e. interaction 
terms, and this is important for discriminability. Second, the 





Fig. 1. Overall network architecture. In total there are three hidden layers, plus an input layer and a linear output layer. There are two main stages: a 
convolutional filtering and pooling stage, and a classification stage. Only the final layer of weights, Wout is learnt, and this is achieved in a single batch using 
least squares regression. Of the remaining weights matrices, Wpiiter is specified and remains fixed, e.g. taken from Overfeat [24]; Wpooi describes standard 
average pooling and downsampling; and Win is set randomly or by using the method of [19] that specifies the weights by sampling examples of the training 
distribution, as described in the text. Other variables shown are as follows: is the number of pixels in an image, L is the number of features extracted per 

image, i9 is a downsampling factor, M is the number of hidden units in the classifier stage and N is the number of classes. 


square root transforms the distribution of the hidden-unit 
responses; we have observed that in practice, the result of the 
square root operation is often a distribution that is closer to 
Gaussian than without it, which helps to regularise the least 
squares regression method of training the output weights. 

However, as will be described shortly, the classifier of 
Stage 2 also has a square nonlinearity. Using this nonlinearity, 
we have found that classification performance is generally 
optimised by taking the square root of the input to the random 
projection layer. Based on this observation, we do not strictly 
use LP-pooling, and instead set 


gi{u) = u^, 

( 3 ) 

92{v)=v^-^\ 

( 4 ) 


This effectively combines the implementation of L2-pooling, 
and the subsequent square root operation. 

B. Stage 2 Architecture: Classifier 

The following descriptions are applicable whether or not 
raw pixels are treated as features or the input is the features 
extracted in stage 1. First, we introduce notation. Let: 

• Ftrain, of size L X K, Contain each length L feature 
vector; 

• Yiabei be an indicator matrix of size N x K, which 
numerically represents the labels of each training vec¬ 
tor, where there are N classes—we set each column 
to have a 1 in a single row, corresponding to the label 
class for each training vector, and all other entries to 
be zero; 

• Win, of size M X L be the real-valued input weights 
matrix for the classifier stage; 

• Wont 5 of size N X M he the real-valued output 
weights matrix for the classifier stage; 

• the function g{') be the activation function of each 
hidden-unit; for example, g{') may be the logistic 
sigmoid, g{z) = 1/(1 + exp(— 2 ;)), or a squarer, 
9{z)=z^-, 


* -^train — .9^(^^inFtrain)9 of sizc JVL X K, Contain the 
hidden-unit activations that occur due to each feature 
vector; g{') is applied termwise to each element in the 
matrix ^VinF^j-nin* 

C. Stage 1 Training: Filters and Pooling 

In this paper we do not employ any form of training 
for the filters and pooling matrices. The details of the filter 
weights and form of pooling used for the example classification 
problems presented in this paper are given Section III. 

D. Stage 2 Training: Classifier Weights 

The training approach for the classifier is that described by 
e.g. [6-8,13]. The default situation for these methods is that 
the input weights. Win, are generated randomly from a spe¬ 
cific distribution, e.g. standard Gaussian, uniform, or bipolar. 
However, it is known that setting these weights non-randomly 
based on the training data leads to superior performance [14, 
16,19]. In this paper, we use the method of [19]. The input 
weights can also be trained iteratively, if desired, using single¬ 
batch backpropagation [17]. 

Given a choice of Win, the output weights matrix is 
determined according to 

^^out — (5) 

where is the size KxM Moore-Penrose pseudo inverse 

corresponding to Atrain- This solution is equivalent to least 
squares regression applied to an overcomplete set of linear 
equations, with an A-dimensional target. It is known to often 
be useful to regularise such problems, and instead solve the 
following ridge regression problem [12,13]: 

^^out — (Atrain -\- cl) , (6) 

where c is a hyper-parameter and I is the M x M identity 
matrix. In practice, it is efficient to avoid explicit calculation 
of the inverse in Equation (6) [14] and instead use QR 
factorisation to solve the following set of NM linear equations 
for the NM unknown variables in Wout- 

Yiabel Atrain ~ Wout (Atrain Attain + C^) 












































Above we mentioned two algorithms, and Algorithm 2 is sim¬ 
ply to form Atrain and solve Eqn. (7), followed by optimisation 
of c using ridge regression. For large M and K > M (which 
is typically valid) the runtime bottleneck for this method is 
typically the 0{KM‘^) matrix multiplication required to obtain 
the Gram matrix, Atrain 

E. Application to Test Data 

For a total of Tftest test images contained in a matrix Xtest. 
we first obtain a matrix Ftest = 52 (Wpooi 5i(WFiiterXtest))> 
of size L X ATtest, by following Algorithm 1. The output of 
the classifier is then the N x Tftest ruatrix 

Ytest ~ Wout ^'(WinFtest) (8) 

= Wout 5'(Win 5'2(Wpool ^l(WFilterX))). (9) 

Note that we can write the response to all test images in terms 
of the training data: 

Ytest ~ Yiabel (5(WinF 

train)) ^(WinFtest) (10) 

where 

Ftrain ^2(^^Pool (^^FilterXtrain)) (H) 

Ftest ^2(Wpool 5'l(WFiiterXtest))- (12) 

Thus, since the pseudo-inverse, (•)+, can be obtained from 
Equation (6), Equations (10), (11) and (12) constitute a closed- 
form solution for the entire test-data classification output, given 
specified matrices, Wfuter. Wpooi and Win, and hidden-unit 
activation functions, and g. 

The final classification decision for each image is obtained 
by taking the index of the maximum value of each column of 

"^test • 

III. Image Classification Experiments: Specific 
Design 

We examined the method’s performance when used as a 
classifier of images. Table I lists the attributes of four well 
known databases we used. For the two databases comprised 
from RGB images, we used (7 = 4 channels, namely the raw 
RGB channels, and a conversion to greyscale. This approach 
was shown to be effective for SVHN in [26]. 


2) NORB-small: downsample from 96x96 to 32x32, 
for implementation efficiency reasons (this is con¬ 
sistent with some previous work on NORB-small, 
e.g. [5]); 

3) SVHN: convert from 3 channels to 4 by adding a 
conversion to greyscale from the raw RGB. We found 
that local and/or global contrast enhancement only 
diminished performance; 

4) CIFAR-10: convert from 3 channels to 4 by adding 
a conversion to greyscale from the raw RGB; apply 
ZCA whitening to each channel of each image, as 
in [3]. 

B. Stage 1 Design: Filters and Pooling 

Since our objective here was to train only a single layer of 
the network, we did not seek to train the network to find filters 
optimised for the training set. Instead, for the size W xW two- 
dimension filters, hi^c^ we considered the following options: 

1) simple rotated bar and corner filters, and square 
uniform centre-surround filters; 

2) filters trained on Imagenet and made available in 
Overfeat [24]; we used only the 96 stage-1 ‘accurate’ 
7x7 filters; 

3) patches obtained from the central W x W region 
of randomly selected training images, with P/N 
training images from each class. 

The filters from Overfeat^ are RGB filters. Hence, for the 
databases with RGB images, we applied each channel of 
the filter to the corresponding channel of each image. When 
applied to greyscale channels, we converted the Overfeat filter 
to greyscale. For NORB, we applied the same filter to both 
stereo channels. For all filters, we subtract the mean value 
over all dimensions in each channel, in order to ensure a 
mean of zero in each channel. 

In implementing the two-dimensional convolution oper¬ 
ation required for filtering the raw images using we 

obtained only the central ‘valid’ region, i.e. for images of size 
J X J, the total dimension of the valid region is {J — W 
Consequently, the total number of features per image obtained 
prior to pooling, from P filters, and images with (7 channels 
isL = CP{J-Wpl)^. 


Database 

Classes 

Training 

Test 

Channels 

Pixels 

MNIST [20] 

10 

60000 

10000 

1 

28x28 

NORB-small [23] 

5 

24300 

24300 

2 (stereo) 

32x32 

SVHN [22] 

10 

604308 

26032 

3 (RGB) 

32x32 

CIFAR-10 [21] 

10 

50000 

10000 

3 (RGB) 

32x32 


TABLE L Image Databases. Note that the NORB-small database 
consists of images of size 96 x 96 pixels, but we first downsampled all 
training and test images to 32 x 32 pixels, as in [5]. 


A. Preprocessing 


In previous work, e.g. [25], the form of the Q x Q two- 
dimension filter, hp is a normalised Gaussian. Instead, we used 
a simple summing filter, equivalent to a kernel with all entries 
equal to the same value, i.e. 


h 


p,u,v 


1 


u=l,.. .Q,v = 1,.. .Q. 


(13) 


In implementing the two-dimensional convolution operation 
required for filtering using hp, we obtained the ‘full’ convolu¬ 
tional region, which for images of size JxJis {J — W PQY , 
given the ‘valid’ convolution first applied using as de¬ 
scribed above. 


All raw image pixel values were scaled to the interval 
[0,1]. Due to the use of quadratic nonlinearities and FP- 
pooling, this scaling does not affect performance. The only 
other preprocessing done was as follows: 

1) MNIST: None; 


The remaining part of the pooling step is to downsample 
each image dimension by a factor of D, resulting in a total 
of L = L/D^ features per image. In choosing D, we exper¬ 
imented with a variety of scales before settling on the value 

^Available from http://cilvr.nyu.edu/doku.php?id=software:overfeat:start 




shown in Table IT We note there exists an interesting tradeoff 
between the number of filters P, and the downsampling factor, 
D. For example, in [3], D = L/2, whereas in [5] D = 1. We 
found that, up to a point, smaller D enables a smaller number 
of filters, P, for comparable performance. 

The hyper-parameters we used for each dataset are shown 
in Table II. 


Hyper-parameter 

MNIST 

NORB 

SVHN 

CIFAR-10 

Filter size, W 

7 

7 

7 

7 

Pooling size, Q 

8 

10 

7 

7 

Downsample factor, D 

2 

2 

5 

3 


TABLE IL Stage 1 Hyper-parameters (Convolutional Feature 
Extraction). 


C. Stage 2 Design: Classifier projection weights 

To construct the matrix Win we use the method proposed 
by [18]. In this method, each row of the matrix Win is 
chosen to be a normalized difference between the data vectors 
corresponding to randomly chosen examples from distinct 
classes of the training set. This method has previously been 
shown to be superior to setting the weights to values chosen 
from random distributions [14,18]. 

For the nonlinearity in the classifier stage hidden units, 
g{z), the typical choice in other work [13] is a sigmoid. 
However, we found it sufficient (and much faster in an im¬ 
plementation) to use the quadratic nonlinearity. This suggests 
that good image classification is strongly dependent on the 
presence of interaction terms—see the discussion about this in 
Section II-A. 

D. Stage 2 Design: Ridge Regression parameter 

With these choices, there remains only two hyper¬ 
parameters for the Classifier stage: the regression parameter, 
c, and the number of hidden-units, M. In our experiments, 
we examined classification error rates as a function of varying 
M. For each M, we can optimize c using cross-validation. 
However, we also found that a good generic heuristic for 
setting c was 

N‘^ 

C = ^min(diag(AtrainAt,^i„)), (14) 

and this reduces the number of hyper-parameters for the 
classification stage to just one: the number of hidden-units, 

M. 

E. Stage 1 and 2 Design: Nonlinearities 

For the hidden-layer nonlinearities, to reiterate, we use: 

gi{u) = u^, g 2 {v) = g{z) = z^. (15) 

IV. Results 

We examined the performance of the network on classify¬ 
ing the test images in the four chosen databases, as a function 
of the number of filters, P, the downsampling rate P, and the 
number of hidden units in the classifier stage, M. We use the 
maximum number of channels, C, available in each dataset 


(recall from above that we convert RGB images to greyscale, 
as a fourth channel). 

We considered the three kinds of untuned filters described 
in Section III-B, as well as combinations of them. We did not 
exhaustively consider all options, but settled on the Overfeat 
filters as being marginally superior for NORB, SVHN and 
CIFAR-10 (in the order of 1% in comparison with other 
options), while hand-designed filters were superior for MNIST, 
but only marginally compared to randomly selected patches 
from the training data. There is clearly more that can be 
investigated to determine whether hand-designed filters can 
match trained filters when using the method of this paper. 

A. Summary of best performance attained 

The best performance we achieved is summarised in Ta¬ 
ble III. 


Database 

C 

M 

P 

Our best 

State-of-the-art 

MNIST 

1 

12000 

60 

031% 

0.39% [27,28] 

NORB-small 

2 

3200 

60 

2.21% 

2.53% [29] 

SVHN 

4 

40000 

96 

3.96% 

1.92% [28] 

CIFAR-10 

4 

40000 

96 

24.14% 

9.78% [28] 


TABLE III. Results for various databases. The state-of-the-art result 
listed for MNIST and CIFAR-10 can be improved by augmenting the 
training set with distortions and other methods [30-32]; we have not 
done so here, and report state-of-the-art only for methods not doing so. 


B. Trend with increasing M 

We now use MNIST as an example to indicate how classi¬ 
fication performance scales with the number of hidden units in 
the classifier stage, M. The remain parameters were W = 1, 
D = 3 and P = 43, which included hand-designed filters 
comprised from 20 rotated bars (width of one pixel), 20 rotated 
corners (dimension 4 pixels) and 3 centred squares (dimensions 
3, 4 and 5 pixels), all with zero mean. The rotations were 
of binary filters and used standard pixel value interpolation. 
Figure 2 shows a power law-like decrease in error rate as M 
increases, with a linear trend on the log-log axes. The best error 
rate shown on this figure is 0.40%. As shown in Table III, 
we have attained a best repeatable rate of 0.37% using 60 
filters and D = 2. When we combined Overfeat filters with 
hand-designed filters and randomly selected patches from the 
training data, we obtained up to 0.32% error on MNIST, but 
this was an outlier since it was not repeatedly obtained by 
different samples of Win. 

C. Indicative training times 

For an implementation in Matlab on a PC with 4 cores and 
32 GB of RAM, for MNIST (60000 training points) the total 
time required to generate all features for all 60000 training 
images from one filter is approximately 2 seconds. The largest 
number of filters we used to date was 384 (96 RGB-i-greyscale), 
and when applied to SVHN (^600000 training points), the 
total run time for feature extraction is then about two hours 
(in this case we used batches of size 100000 images). 

The runtime we achieve for feature generation benefits 
from carrying out convolutions using matrix multiplication ap¬ 
plied to large batches simultaneously; if instead we iterate over 
all training images individually, but still carry out convolutions 





Fig. 2. Example set of error percentage value on the 10000 MNIST test 
images, for ten repetitions of the selection Win- The best result shown is 40 
errors out of 10000. Increasing M above 6400 saturates in performance. 


using matrix multiplication, the time for generating features 
approximately doubles. Note also that we employ Matlab’s 
sparse matrix data structure functionality to represent Wpiiter 
and Wpooi, which also provides a speed boost when multi¬ 
plying these matrices to carry out the convolutions. If we do 
not use the matrix-multiplication method for convolution, and 
instead apply two-dimensional convolutions to each individual 
image, the feature generation is slowed even more. 

For the classifier stage, on MNIST with M = 6400, the 
runtime is approximately 150 seconds for I) = 3 (there is a 
small time penalty for smaller D, due to the larger dimension 
of the input to the classifier stage). Hence, the total run time 
for MNIST with 40 filters and M = 6400 is in the order of 4 
minutes to achieve a correct classification rate above 99.5%. 
With fewer filters and smaller M, it is simple to achieve over 
99.2% in a minute or less. 

For SVHN and CIFAR-10 where we scaled up to M = 
40000, the run time bottleneck is the classifier, due to the 
0{KM‘^) runtime complexity. We found it necessary to use a 
PC with more RAM (peak usage was approximately 70 GB) 
for M > 20000. In the case of M = 40000, the network 
was trained in under an hour on CIFAR-10, while SVHN took 
about 8-9 hours. Results within a few percent of our best, 
however, can be obtained in far less time. 

V. Discussion and Conclusions 

As stated in the introduction, the purpose of this paper 
is to highlight the potential benefits of the method presented, 
namely that it can attain excellent results with a rapid training 
speed and low implementation complexity, whilst only suffer¬ 
ing from reduced performance relative to state-of-the-art on 
particularly hard problems. 

In terms of efficacy on classification tasks, as shown in 
Table III, our best result (0.37% error rate) surpasses the best 
ever reported performance for classification of the MNIST test 
set when no augmentation of the training set is done. We have 
also achieved, to our knowledge, the best performance reported 


in the literature for the NORB-small database, surpassing the 
previous best [29] by about 0.3%. 

For SVHN, our best result is within ^ 2% of state-of-the- 
art. It is highly likely that using filters trained on the SVHN 
database rather than on Imagenet would reduce this gap, given 
the structured nature of digits, as opposed to the more complex 
nature of Imagenet images. Another avenue for closing the gap 
on state-of-the-art using the same filters would be to increase 
M and decrease D, thus resulting in more features and more 
classifier hidden units. Although we increased M to 40000, 
we did not observe saturation in the error rate as we increased 
M to this point. 

For CIFAR-10, it is less clear what is lacking in our method 
in comparison with the gap of about 14% to state-of-the-art 
methods. We note that CIFAR-10 has relatively few training 
points, and we observed that the gap between classification per¬ 
formance on the actual training set, in comparison with the test 
set, can be up to 20%. This suggests that designing enhanced 
methods of regularisation (e.g. methods similar to dropout in 
the convolutional stage, or data augmentation) are necessary to 
ensure our method can achieve good performance on CIFAR- 
10. Another possibility is to use a nonlinearity in the classifier 
stage that ensures the hidden-layer responses refiect higher 
order correlations than possible from the squaring function 
we used. However, we expect that training the convolutional 
filters in Stage 1 so that they extract features that are more 
discriminative for the specific dataset will be the most likely 
enhancement for improving results on CIFAR-10. 

Finally, we note that there exist iterative approaches for 
training the classifier component of Stage 2 using least 
squares regression, and without training the input weights— 
see, e.g., [14,33,34]. These methods can be easily adapted for 
use with the convolutional front-end, if, for example, additional 
batches of training data become available, or if the problem 
involves online learning. 

In closing, following acceptance of this paper, we became 
aware of a newly published paper that combines convolutional 
feature extraction with least squares regression training of clas¬ 
sifier weights to obtain good results for the NORB dataset [35]. 
The three main differences between the method of the current 
paper and the method of [35] are as follows. First, we used 
a hidden layer in our classifier stage, whereas [35] solves for 
output weights using least squares regression applied to the 
output of the pooling stage. Second, we used a variety of 
methods for the convolutional filter weights, whereas [35] uses 
orthogonalised random weights only. Third, we downsample 
following pooling, whereas [35] does not do so. 
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